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Summary: Sequencing-based studies are emerging as a major tool for genetic association studies of complex diseases. These 
studies pose great challenges to the traditional statistical methods because of the high-dimensionality of data and the low 
frequency of genetic variants. Moreover, there is a great interest in biology and epidemiology to identify genetic risk factors 
contributed to multiple disease phenotypes. The multiple phenotypes can often follow different distributions, which brings an 
additional challenge to the current statistical framework. In this paper, we propose a generalized similarity U test, referred 
to as GSU. GSU is a similarity-based test that can handle high-dimensional genotypes and phenotypes. We studied the 
theoretical properties of GSU, and provided the efficient p-value calculation for association test as well as the sample size 
and power calculation for the study design. Through simulation, we found that GSU had advantages over existing methods 
in terms of power and robustness to phenotype distributions. Finally, we used GSU to perform a multivariate analysis of 
sequencing data in the Dallas Heart Study and identified a joint association of 4 genes with 5 metabolic related phenotypes. 
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1. Introduction 

Genome-wide association studies (GWAS) have made sub¬ 
stantial progress in discovering common genetic variants 
associated with complex diseases. Despite such success, a 
large proportion of heritability of complex diseases remains 
unexplained. Converging evidence has suggested that rare 
variants with minor allele frequency (MAF) less than 5% or 
1% hold promise in accounting for a significant proportion 
of the missing heritability (|Fay et al.| 2001| Pritchard 2001 


|Kryukov et al.||2007| |Boyko et al.||2008| ). With the advance of 
next-generation sequencing (NGS) technology, we have now a 
unique opportunity to investigate the role of a wider scope of 
genetic variants, primarily rare variants, in human diseases. 
Evidence in early sequencing studies have already shown that 
rare variants played an important role in complex diseases 
(|Cohen et al.|[2004||Ahituv et ah||2007||ji et al.[|2008||Romeo| 


et al. 2009). Although promising, the massive data generated 


from sequencing studies poses great challenges on the statis¬ 
tical methods. Rare mutations are recent mutations, and can 
be only found in a small fraction of individuals in the entire 
population. Even with a large effect size, a rare variant is hard 
to be detected because of its low MAF. Moreover, the massive 
number of rare variants raises the computational burden and 
the multiple comparison issue. Therefore, traditional single¬ 
locus analysis has low power to detect rare variants. 

Many new statistical methods have been developed for the 
sequencing data in the last few years. Different from the 


traditional single-locus analysis, most of the new methods 
perform a joint association test, namely, testing the joint 
effect of a set of single nucleotide variants (SNVs) on a 
genomic region, a functional unit (e.g., a gene) or a functional 
pathway. The advantage of the joint association test over 
the single-locus analysis lies in the fact that, by combining 
multiple SNVs, not only the association information is ag¬ 
gregated but also the number of tests is greatly reduced. 
The joint association tests can be briefly classified into two 
categories: burden tests and non-burden tests. Burden tests 
first summarize multiple rare variants into a univariate genetic 
score, and then test the association of the summary score 


with the disease phenotype.)Morgenthaler and Thilly 2007 


Li and Leal | 2008; Madsen - and Browning||2009[|Lin and Tang 


2011). Burden tests often assume the effect of the multiple 


variants have similar magnitude and direction. Non-burden 
tests, on the other hand, can take into account of the effect 
heterogeneity within the SNV-set, by considering the effect 
of the multiple variants as random effects or function of 


genomic position.(Neale et al. 2011 Wu et al. 2011 Luo 


|et al.[ J2012}. Among non-burden methods, sequence kernel 
association test (SKAT) shares many nice computational and 
asymptotic properties with the variance component score test, 
and has been very popularly used( |Wu et al.[[201 1). 

Most of existing joint association tests are parametric- 
based methods, which often rely on certain assumptions (e.g., 
a normal distribution assumption). When the assumptions are 
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violated, they are subjected to power loss or an inflated type I 
error. Moreover, there lacks the development of new methods 
for the multivariate analysis of sequencing data, especially 
when the phenotype follows different different distributions 
(e.g., some phenotypes are binary while the others are con¬ 
tinuous). The study of multiple phenotypes is important in 
biomedical research. Many studies collect multiple biochemi¬ 
cal measurements related to a disease of interest. These mul¬ 
tiple measurements evaluate different aspects of the disease, 
and thus better reflect the underlying biological mechanism of 
the disease. It also becomes popular for human genetic studies 
to collect and study multiple disease phenotypes. For instance, 
studies have identified common gene variants contributed 


to co-morbidity of substance dependence!Dick and Agrawal 
[2008] ). 


To achieve these goals, we propose a Generalized Similarity 
U test, referred to as GSU. We use two different similarity 
measurements to summarize genetic information and pheno¬ 
typic information, and then form a test under the weighted 
U framework. We derive the asymptotic distribution of the 
test statistic and implement the new method in R. The 
proposed method has several remarkable features: 1 ) it is non- 
parametric and is thus robust to phenotype distributions; 2 ) it 
can handle multiple different phenotypes (e.g., a combination 
of binary and continuous phenotypes); 3)it has a nice statis¬ 
tical property and performs well under small sample sizes. 
Simulation studies are conducted to compare the type 1 error 
and the power of our method with those of several commonly 
used methods. Finally, we applied the new method to the 
Dallas Heart Study to test the association of 4 candidate genes 
with 5 metabolic-related phenotypes. 


2. Generalized Similarity U 

2.1 Weighted U Statistic 

Suppose that n subjects are sequenced in a study, where 
we are interested in testing the association of L phenotypes 
{yi,i, l^l^L) with M genetic variants 

M). For each subject i, we observe a 
vector of phenotypes ( yi = (j/i,i, yi, 2 , ■ ■ ■ ,yi,L ) ) and a 
vector of genotypes gi ( g t = - - - ,5 ;,m))- In the 

special case when L = 1 (or M = 1), it is simplified to a 
univariate analysis (or a single-locus analysis). When L > 1, 
it extends to a multivariate analysis. In GSU, we allow 
multiple phenotypes to be of different types (e.g., continuous 
or categorical), and do not assume any distribution of 
phenotypes. The number of genetic variants M and the 
number of phenotypes L can be larger than the sample size. 
For example, the genetic data can be sequencing data (high 
demensional genotypes) and the phenotype data can be 
imaging data (high demensional phenotypes). 

Given the phenotypes and the genotypes for the subjects i 
and j ,we first define their phenotype similarity Sij by, 


Si,j = HViiVi), 

and define their genetic similarity K z . :l by, 


Ki,j = f{gi,gj)- 


The similarity measurements h(- , ■) and /(•,■) defined above 
can be of a general form as long as they satisfy the fi¬ 
nite second moment condition (i.e.,E(h 2 (Yi, Y 2 )) < 00 and 
_E(/ 2 (Gi,G 2 )) < 00 ). We further center the phenotype 
similarity by, 


Sij h(yi,yj) 

= Kvuyj) - E(h(.yi,Yj)) 

-EihiYi^ + EihfYi.Yj)), ( 1 ) 


and center the genetic similarity, Kij = f(gi,gj), in a similar 
manner. The generalized similarity U (GSU) is then defined 
as the summation of the centered phenotype similarities 
weighted by the centered genetic similarities, 

( 2 ) 


where the Kij is considered as the weight function and the 
Sij is considered as the U kernel. In our definition of GSU, 
the role of genetic similarity and phenotype similarity are 
interchangeable. In other words, we can also treat Sij as the 
weight function and Kij as the U kernel. 


2.2 Similarity Measurement 

The choices for the phenotype similarity h(- , ■) and the 
genetic similarity /(•, ■) are flexible. According to different 
types of genetic variants and the purpose of the analysis, 
we can choose different types of phenotype similarities and 
genetic similarities. 

For the categorical SNVs data, we can use either the IBS 
function or the weighted IBS function to measure the genetic 
similarity ( |Lynch and Ritl and, 1999]). Assuming the genetic 
variants (pi, m , M) are coded as 0, 1 and 

2, respectively for AA, Aa and aa, we can define the IBS-based 
genetic similarity between subjects i and j as, 


K 


~ 2M E 2 l5i ’ m 


9j,m | 


Alternatively, we can use a weighted-IBS (wIBS) genetic 
similarity to emphasize the effects of rare variants, 


K .J 


iIBS 


E 


i i,m 9j,m |) 


where w m represents the weight for the m-th SNV in the SNV- 
set, and T is a scaling constant, defined as T = 2 w m . 

Wm is usually defined as a function of minor allele frequency 
(MAF, denoted as 7 m ). For example, we can define the 
weight w m as w m = l/\/ 7 m(l — 7 m) • When the genetic 
data are count data or continuous data, we can use other 
forms of /(■,•) to measure genetic similarity (e.g., euclidian 
distance based similarity), which, we will leave for further 
investigations. 

For phenotype similarity, we define a unified measurement 
for both categorical and continuous phenotypes based on a 
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normal quantile. For each phenotype, yi (1 ^ l ^ L), we 
define the corresponding normal quantile by, 

qi,i = "F -1 ((rank(y it i) - 0.5 )/n), 

where rankf ) corresponds to the rank of the phenotype value 
yij. When there are ties, we assign the averaged rank. <F -1 (-) 
is the inverse cumulative density function for a standard 
normal distribution. We can then calculate the Euclidian 
Distance (ED) based phenotype similarity between subjects i 
and j by, 

L 

sff = exp(— Y ui(qi,i - 

i=i 

where coi represents the weight for the Z-th phenotypes given 
based on prior knowledge. If there is no prior knowledge, we 
can use an equal weight, toi = 1/L. The ED-based phenotype 
similarity can be easily modified to take the correlation among 
the phenotypes into account, 

Sff = exp - qj) T V(qi - q^ , 

where qi = (qn,-" ,qiL) T - T can be chosen to reflect the 
correlations among the phenotypes. For example, we can 
define F as, 

r = (^E®«?’r 1 - 

i=1 


Other than the ED based phenotype similarity, we can define 


the phenotype similarity using cross product!Tzeng et al. 


2009). The types of phenotype similarity or genetic similarity 


can be chosen based on different purposes, which may influ¬ 
ence the power of the method. For simplicity, we used an 
ED-based phenotype similarity in this paper. 


2.3 Hypothesis testing 

Based on the definition of the centered similarity, we can 
show that E(f(Gi,Gj)) = 0 and E{h(Yi,Yj )) = 0 (Appendix 
A). Under the null hypothesis, when the genetic variants are 
not associated with multiple phenotypes, we have E(U) = 0 
(Appendix A). Under the alternative hypothesis, when the 
genetic variants are associated with multiple phenotypes, we 
expect that the phenotype similarity is concordant with the 
genetic similarity. In other words, the positive phenotype 
similarities are weighted heavier and the negative phenotype 
similarities are weighted lighter, leading to a positive value 
of U statistic. A statistical test can be formed to test the 
association and its p-value can be calculated by P(U > U 0 b s ), 
where U 0 b a is the observed value of U. Tests based on 
permutation or bootstrap can be implemented to obtain 
the statistical significance. Nevertheless, both methods 
are computationally expensive for high-dimensional data. 
Therefore, we derive the asymptotic distribution of GSU to 
assess the statistical significance of the association test. 


By considering the genetic similarity as the weight func¬ 
tion and the phenotype similarity as the U kernel, GSU 


is a weighted U statistic! Gregory 

1977 

Shapiro and Hu- 

bert| 

1979, 

O’Neil and Redner, 1993; |Shieh[ |1997; Lindsay 

et al. 

2008 

). More specifically, because its kernel satisfied 


Var(E(h(Yi,Y 2 )\Y 2 )) = 0 (Appendix A), GSU is a degener¬ 
ated weighted U statistic. To derive the limiting distribution 
of GSU, we can decompose the centered phenotype similarity 
by, 

oo 

KvuVt) = ^2j\scj>s{yi)(l)s{y2), 

s= 1 

where {A s } and {0 S (-)} are eigenvalues and eigenfunctions of 
the U kennel h(- , •), and all the eigenfunctions are orthogonal, 

[^s{yi)(t)s'{yi)dF{y 1 ) = l 0 ’ lis ^ s , 

J [1, lf s = s . 

Similarly, we can decompose the centered genetic similarity 
by, 

oo 

f(Gi, Gj) = Y qtvt{g i)pt(p2). 

t =1 

We can then rewrite the GSU as, 


. oo oo / n 

u = ^iEE sign(rjt\ s ) ( ,— ^ (Yi) 

1 t=i s=l V E i=1 

1 OO OO 1 n 

-^lEE sign(r)tX 3 )- 

t=l S =1 i= 1 

where q>*(Gi) = |??t| 0 ' 5 ^t(Gi) and 0*(Yj) = |A S | 0 ' 5 <(> S (Y) 
(Appendix B). 

Using the form above, we can show that the limiting 
distribution of GSU is a weighted sum of independent chi- 
square random variables. This is the result of theorem 1 below, 
which is proved in Appendix B. 

Theorem 1: Suppose E(h 2 (Y 1 ,Y 2 )) < oo, 

E(f 2 (Gi,G 2 )) < oo, and Y T G. Let h{Y u Y 2 ) and 

f{Gi,G 2 ) be the centered similarities as defined in |i]). 
Define U as U = £*,■ f(Gi, G,)MY, Yj). Then, 

nU A T,'ZLiVtJ27=i X s('X 2 st - !)> where ix 2 st} are 
independent chi-square random variables with 1 degree 
of freedom. 

2.4 The power and sample size calculation 
In this subsection, we derive the asymptotic distribution of 
GSU under the alternative hypothesis, and provide power 
and sample size calculations for sequencing association 
studies. 


Assume under the alternative hypothesis 
that E(f(G 1 ,G 2 )h(Y 1 ,Y 2 )) = p, > 0 and 

Var(f(Gi,G 2 )h(Y 1 ,Y 2 )\(G 2 ,Y 2 )) = Cr > 0 , where p 
measures the strength of the association. It is easy to show 
that GSU is an unbiased estimator of p, 


E{U) 


1 

n(n — 1) 


Y E (f{Gi, Gj)h(Yi, Yj)) 


= A'- 


Using the Hoeffding projection, we can show that GSU 
asymptotically follows a normal distribution, with mean p 
and variance 4£i /n. This is the result of Theorem 2, which is 
proved in Appendix C. 
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Theorem 2: Let h(Yi,Y 2 ) and f(Gi,G 2 ) be the cen¬ 
tered similarities as defined in 0- Suppose Y is asso¬ 
ciated with G, and the following conditions are satisfied: 
E(f(G 1 ,G 2 )h{Yi,Y 2 )) = p > 0, Var(f(G u G? 2 )ft(Yi, Y 2 )) = 
Co < oo, and Var(f(Gi, G 2 )h(Y\, Y 2 )\{G 2 , Y 2 )) = Ci > 0. 
Define U as U = /(Gi, G^hiY, Y-). Then, 

Vn{U — n ) N ( 0 , 4(i). 


The power of GSU at the significance level a can be 
calculated by, 


P{nU > qi- a } 

=p f Vn(U- p) 




=H 


2\/Ci 
np — qi- a 
2\/nCi 


> 


qi-c 


np 




), 


where qi- a is the 1 — a quantile for r/t A s (x«t — 1) 
and <f>(-) is the CDF of a standard normal distribution. The 
sample size required to achieve power p can be calculated by 
solving <f>( ^ p. By denoting Zp as the p quantile 

for a standard normal distribution, the required sample size 
is given by, 


n = min 

nSN 



(W£ + + pqi-c'j ] 

p 2 j ' 


2.5 Computation and implementation 

Let S = {Sij} nX n an d K = {Kij} nxn be the matrix form of 
the phenotype similarity and genetic similarity, the centered 
similarity matrices S and K can be obtained by, 


S = (I - J)S(I - J), 


K = (/ - J)K{I - J), 


where I is an n-by-n identity matrix, and J is an n-by-n matrix 
with all elements being 1/n (Appendix D). Then GSU can be 
expressed as, 


U = 


1 

n(n — 1) 




In this form, U can be viewed as a sum of the element-wise 
product of the two matrices, Ko and So, which are obtained 
by assigning 0 to the diagonal elements of matrices I\ and 
S. We then use matrix eigen-decomposition to approximate 
the eigen-values in function decomposition. Let {A s } and 
{fjt } respectively be the eigen-values for matrices Ko and So 
(Appendix E), the limiting distribution of U is given by, 


Z—' n. z —/ ri 


n 

t = 1 S— 1 


3. Simulation study 

3.1 Simulation method 


To mimic real genetic structure, we used genetic data from 
the 1000 Genome Project!Abecasis et al. 2010). Based on 
the genetic data, we then simulated phenotype values. In 
particular, we used a 1Mb region of the genome (Chromosome 
17: 7344328-8344327) from the 1000 Genome Project. For 
each simulation replicate, we randomly chose a 30kb segment 
from the 1Mb region and formed a SNV-set for the analysis. 
From the SNV-set, we set a proportion of the SNVs as causal. 
A number of individuals were randomly chosen from the total 
1092 individuals as the simulation sample to study the perfor¬ 
mance of the methods. To investigate the robustness against 
different phenotype distributions, we simulated three types 
of phenotypes: a binary-distributed phenotype, a Gaussian- 
distributed phenotype and a Cauchy-distributed phenotype. 
The binary-distributed phenotype was simulated by using a 
logistic regression model, 


logit(P(Yi = 1 ))=p + Gjp, 


where Y, and Gi were the phenotype value and the genotype 
vector (coded as 0, 1, and 2) for the i-th individual, respec¬ 
tively. P were the effects of the SNVs, which were sampled 
from a uniform distribution with a mean of pp and a variance 
of erj|. The Gaussian-distributed phenotype was simulated by 
using the linear regression model, 

Yi = p -f- Gi P + Si , Si ~ A r (0, c ). 


The Cauchy-distributed phenotype mimicked a situation in 
which the continuous phenotype had more extreme values 
(i.e. heavy-tailed), and was simulated by using the following 
model, 

Yi ~ cauchy(ai, b), ai = p + Gj p, 


where ai and b are the location parameter and the scale 
parameter of the Cauchy distribution, respectively. 

Two sets of simulation were performed. In simulation I, 
we considered a single phenotype, while in simulation II, we 
considered multiple-phenotypes. 


3.1.1 Simulation I Setting. Under the null, the models 
were simulated by setting pp = 0 and = 0, and by 
varying the sample size from 50 to 500 (i.e. 50, 100, 200 and 
500). Under the alternative, two sets of disease models were 
simulated: 

(1) Set pp = 0 and ap > 0 so that half of the causal SNVs 
were deleterious and the other half were protective. 

(2) Set pp > 0 and a 2 p > 0 so that majority of the causal 
SNVs were deleterious. 

The details of the simulation setting can be found in Table 
SI of Supplementary Materials. 


3.1.2 Simulation II Setting. Two sets of models were sim¬ 
ulated: 


where {Xst} ar e independent chi-square random variables 
with 1 degree of freedom. The p-value can be calculated by us¬ 


ing the Davies’ method (Davies 19801, the Liu’s method!Liu 
|et al.|[2009|) or the Kuonen’s method! Kuonen| |1999|). 


(1) Assume the multiple phenotypes follow the same distri¬ 
bution. In particular, we simulated 3 binary-distributed 
phenotypes (BBB), 3 Gaussian-distributed phenotypes 
(GGG), and 3 Cauchy-distributed phenotypes (CCC). 























Generalized Similarity U 


5 


(2) Assume the multiple phenotypes follow different dis¬ 
tributions. In particular, we simulated 3 pheno¬ 
types with 2 binary-distributed phenotypes and 1 
Gaussian-distributed phenotype (BBG), 3 phenotypes 
with 1 binary-distributed phenotypes and 2 Gaussian- 
distributed phenotypes (BGG), and 3 phenotypes with 
1 binary-distributed phenotype, 1 Gaussian distributed 
phenotype and 1 Cauchy distributed phenotype (BGC). 

Similar as the simulation I, for the null model, we set pp = 0 
and erjg = 0 . For the alternative models, we set up = 0 and 
<jp > 0 , and allowed the multiple phenotypes to be influenced 
by different sets of causal SNVs. The details of the simulation 
setting were described in Table S2 of Supplementary Materi¬ 
als. 


3.2 Simulation result 

We evaluated the performance of GSU by comparing it with 
three existing methods, SKAT( Wu et ah] 20111, AdjSKAT 
and SKATO(Lee et al. 20121. For each simulation, we created 
1000 simulation replicates to evaluate type 1 error and power. 
For GSU, we used the wIBS-based genetic similarity and ED- 
based phenotype similarity to construct the U statistic. To 
be consistent, we used the same weighted IBS to construct 
the kernel for SKAT. Because neither AdjSKAT nor SKATO 
had an option for a weighted IBS kernel, we used the default 
kernel, the weighted linear kernel, when applying these two 
methods. SKAT, AdjSKAT and SKATO are designed for 
univariate analysis. To consider multiple phenotypes, we chose 
the most significant p-value from the univariate analysis, 
and then adjust for multiple tests by using the Bonferroni 
correction. 


3.2.1 Result for Simulation I. The type I error rates of 
the 4 methods are summarized in Table [I] GSU had a well- 
controlled type I error, regardless of phenotype distributions 
and sample sizes. Neverthless, SKAT, AdjSKAT and SKATO 
had inflated type I error rates (ranging from 0.101 to 0.19) 
for the Cauchy-distributed phenotype. When the sample size 
was small (e.g., 50 or 100), SKAT and SKATO also had con¬ 
servative type I error (e.g., 0.001) for the binary-distributed 
phenotype. 

The power comparison is summarized in Figures [l] and [2] 
For the disease model where half of the causal SNVs were dele¬ 
terious (Figure[l]), GSU had higher power than other methods 
for various sample sizes ranging from 50 to 500. For in¬ 
stance, in the simulation with binary phenotype and a sample 
size of 50, GSU (power=0.346) attained much higher power 
than AdjSKAT (power=0.114), SKATO (power=0.057), and 
SKAT (power=0.024). Similarly, for the Gaussian-distributed 
phenotype, GSU had the highest power among the four 
methods, and the three SKAT-based methods had similar 
performance. For the Cauchy-distributed phenotype, where 
the distribution assumption was violated, the power of the 
SKAT-based methods remained similar as the sample sizes 
increased, while GSU had increased power as the sample size 
increased. For the second disease model in which a majority of 
the SNVs were deleterious (Figure [2|, we observed the same 
conclusion, i.e., GSU had highest power regardless of sample 
sizes and phenotype distributions. 


Table 1 


Type I error comparison for the univariate analysis 


Sample size 

Method 


Distribution 




Binary 

Gaussian 

Cauchy 

50 

SKAT 

0.001 

0.030 

0.122 


SKATO 

0.021 

0.041 

0.101 


AdjSKAT 

0.054 

0.028 

0.123 


GSU 

0.058 

0.046 

0.051 

100 

SKAT 

0.014 

0.028 

0.149 


SKATO 

0.035 

0.038 

0.120 


AdjSKAT 

0.063 

0.027 

0.139 


GSU 

0.046 

0.050 

0.053 

200 

SKAT 

0.023 

0.040 

0.155 


SKATO 

0.024 

0.042 

0.140 


AdjSKAT 

0.043 

0.042 

0.156 


GSU 

0.057 

0.048 

0.045 

500 

SKAT 

0.039 

0.055 

0.190 


SKATO 

0.052 

0.056 

0.158 


AdjSKAT 

0.049 

0.053 

0.180 


GSU 

0.038 

0.046 

0.043 


Intuitively, as a semi-parametric model, SKAT should have 
comparable power with GSU. The reason for the advantage 
of GSU lies in the form of test statistics. The score test of 
SKAT utilizes both diagonal terms and off-diagonal terms 
of the similarity matrix. When the null model gives the 
same predicted values (i.e., no covariates), the diagonal terms 
essentially provide no additional information with regard to 
association, while adding more variation (i.e., noise) to the 
score statistic. Meanwhile, GSU utilizes only the off-diagonal 
terms. When the sample size is small, the influence of diagonal 
terms will be significant, which is why we observed the higher 
power of GSU over SKAT. When the sample size becomes 
larger, the off-diagonal terms will dominate, and the difference 
of power of two methods becomes smaller. 


3.2.2 Result for Simulation II. The type I error rates for 
the multivariate analysis are summarized in Table [2] Similar 
to the results of the univariate analysis, GSU can correctly 
control type I error at the level of 0.05 (Table [2|, while 
the other three methods had inflated type I error when the 
distribution assumption was violated (e.g., CCC and BGC). 
When the distribution assumption was satished, AdjSKAT 
correctly controlled the type 1 error. SKAT and SKATO, 
however, had conservative type I error rates for small sample 
sizes (50 and 100), especially when the multiple phenotypes 
comprised of binary-distributed phenotypes (e.g., BBB, BBG 
and BGG). 

The power comparison of four methods under the two 
disease models is summerized in Figures [3] and [4] For the dis¬ 
ease model where the multiple phenotypes followed the same 
distribution, GSU had higher power than the three SKAT- 
based methods (Figure [3j. For simulations with BBB phe¬ 
notypes and GGG phenotypes, GSU attained higher power 
than the other three methods when the sample size was 
50, and obtained substantial power improvement as sample 
sizes increased. For the simulation with CCC phenotypes, we 
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Figure 1. Power comparison for univariate analysis when a majority of causal SNVs are deleterious 
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Figure 2. Power comparison for univariate analysis when half of causal SNVs are protective and the other half of causal 
SNVs are deleterious 


also observed substantial power improvement of GSU than 
the other three methods as sample size increased. Yet, the 
power of the three SKAT-based methods were higher than 
that of GSU when the sample size was 50. This can be 
explained by the inflated type 1 error rates of the SKAT-based 
methods for the Cauchy-distributed phenotypes (Table [2|. 
Similarly, for disease models with a combination of different 
types of phenotype distribution (i.e., BBG, BGG, and BGC), 
GSU had higher power than SKAT, SKATO and AdjSKAT, 


regardless of sample size and phenotype distributions (Figure 



Comparing with univariate analysis in simulation I, GSU 
had further power improvement by considering the association 
of multi-phenotype in one single test. The three SKAT-based 
methods, on the other hand, performed multiple univariate 
analyses with Bonferroni correction. Under the alternative, 
the multiple phenotypes are correlated to each other. The 
univariate analysis with Bonferroni correction is subject to 
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Table 2 

Type I error comparison for multivariate analysis 


Sample size 

Method 

Same Distr. 

a 

Diff Distr. 



BBB 

GGG 

ccc 

BBG 

BGG 

BGC 

50 

SKAT 

0.002 

0.028 

0.232 

0.008 

0.021 

0.097 


SKATO 

0.019 

0.027 

0.207 

0.016 

0.024 

0.085 


AdjSKAT 

0.046 

0.028 

0.237 

0.049 

0.040 

0.113 


GSU 

0.057 

0.047 

0.045 

0.034 

0.043 

0.056 

100 

SKAT 

0.001 

0.023 

0.295 

0.011 

0.013 

0.122 


SKATO 

0.025 

0.039 

0.260 

0.025 

0.036 

0.117 


AdjSKAT 

0.055 

0.025 

0.290 

0.059 

0.042 

0.135 


GSU 

0.048 

0.047 

0.044 

0.051 

0.059 

0.058 

200 

SKAT 

0.020 

0.048 

0.325 

0.033 

0.020 

0.134 


SKATO 

0.030 

0.058 

0.288 

0.035 

0.036 

0.118 


AdjSKAT 

0.059 

0.046 

0.319 

0.064 

0.038 

0.136 


GSU 

0.050 

0.047 

0.049 

0.058 

0.056 

0.043 

500 

SKAT 

0.044 

0.035 

0.364 

0.035 

0.04 

0.166 


SKATO 

0.054 

0.036 

0.315 

0.038 

0.047 

0.156 


AdjSKAT 

0.06 

0.038 

0.342 

0.041 

0.049 

0.172 


GSU 

0.045 

0.054 

0.044 

0.046 

0.039 

0.062 


a B, G, and C represent Binary-distributed, Gaussian-distributed, and Cauchy-distributed pheno¬ 
types, respectively. 


power loss without considering the correction structure of 
multiple phenotypes. Meanwhile, GSU can take the correla¬ 
tion structure into account when constructing the phenotype 
similarity, and therefore attained higher power than the other 
three methods. 


3.3 Computational Efficiency 

We compared the computational efficiency of R-based GSU 
with SKAT, SKATO and AdjSKAT. We found GSU attained 
highest computational efficiency among four methods. For 
example, for the multivariate analysis of BBG phenotype with 
a sample size of 500 on 1000 simulation replicates, SKAT, 
SKATO and AdjSKAT took 5 (4320s), 16 (13686s) and 25 
(20890s) times longer than GSU (849s), using a personal 
computer with 2.3GHz CPU and 4G memory. The detailed 
computational efficiency comparison of four methods for ana¬ 
lyzing data with various distributed phenotypes and different 
sample sizes is given in Table S3 of Supplementary Materials. 

Although the comparisons were made under simulation 
setting, they represent general scenarios in practice. All four 
methods need to spend substantial time on eigen decomposi¬ 
tion in the final step of calculating asymptotic distribution. 
For multivariate analysis, GSU performs the eigen decompo¬ 
sition once for all phenotypes, while SKAT-based methods 
need to perform the eigen decomposition for each phenotype. 
Moreover, if the link function is not identity link, iterative 
estimation for null model and additional large matrix multi¬ 
plication and decomposition are required before the final eigen 
decomposition, which will also increase the computational 
burden of SKAT-based methods. 


4. Application to the Dallas Heart Study 

To evaluate the performance of GSU on real data, we applied 


it to the Dallas Heart Study (DHS) sequence data(Ahituv 


et al. 2007) and compared the result of GSU with those 


of three SKAT-based methods. The DHS sequencing data 
is comprised of 4 genes, ANGPTL3, ANGPTL4, ANGPTL5 
and ANGPTL6. After completing the quality control (e.g., 
removing SNVs with high missing rate), 230 SNVs (54 SNVs, 
63 SNVs, 61 SNVs, and 52 SNVs were from ANGPTL3, 
ANGPTL4, ANGPTL5 and ANGPTL6, respectively) and 
2598 subjects remained for the analysis. In the real data 
analysis, we were interested in testing the association of the 
SNVs in these genes with multiple metabolic-related phe¬ 
notypes, including obesity (dichotomized from BMI using a 
cut-off of 35), cholesterol, higli-density lipoprotein cholesterol 
(HDL), low-density lipoprotein cholesterol (LDL) and very- 
low-density lipoprotein cholesterol (VLDL). 

In order to consider the potential confounding effects of 
age, gender and race, we adjusted these covariates in the 
analysis and used the residuals to build phenotype similarity 
for GSU. Because the three SKAT-based methods cannot 
directly analyze multiple phenotypes, we obtained smallest p- 
values from the univariate analysis of all phenotypes and then 
used the Bonferroni correction to determine whether there 
was a significant association. The results are summarized 
in Table [3] Because all 4 genes were metablolic candidate 
genes, we first combined SNVs in the 4 genes into a single 
SNV-set. For the joint analysis of 4 genes, GSU could de¬ 
tect a significant association of 4 genes with 4 metabolic- 
related phenotypes (p-value=0.028), while SKAT, SKATO 
and AdjSKAT did not detect the association. For further 
exploration, we analyzed each gene separately and tested its 
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Figure 3. 


Power comparison for the multivariate analysis when the multiple phenotypes follow the same type of distribution 
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Figure 4. Power comparison for the multivariate analysis when the multiple phenotypes follow the different types of 
distributions 


association with 4 metabolic-related phenotypes. By using 
GSU, we found a marginal association of ANGPTL4 with 
the multiple phenotypes(p-value=0.057). 

5. Discussion 

Driven by recent developments in sequencing technologies 
and the common-diseases-rare-variants hypothesis, sequenc¬ 
ing studies are now emerging as a major study design for the 


genetic association of complex diseases. Yet, the uniqueness 
of sequencing data, including low MAF and high dimension¬ 
ality, pose daunting challenges to statistical analysis. The 
conventional methods, such as a single-locus analysis using 
logistic regression, had low power for sequencing data anal¬ 
ysis. Instead, joint association analysis, or SNV-set analysis, 
is becoming popular due to its ability to increase power and 
reduce multiple testing. Although the existing joint associa¬ 
tion methods have nice features and are easy to use, they also 
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Table 3 

The multivariate analysis of f metabolic-related phenotypes in the DHS study 


Gene 

Rare variants P-value 

a 



GSU SKAT AdjSKAT SKATO 


ANGPTL3 

54 

0.300 

0.083 

0.071 

0.112 

ANGPTL4 

63 

0.057 

0.079 

0.086 

0.155 

ANGPTL5 

61 

0.075 

0.046 

0.073 

0.107 

ANGPTL6 

52 

0.297 

0.059 

0.065 

0.101 

ALL 

230 

0.028 

0.250 

0.174 

0.245 


a Multiple phenotypes considered in the analysis include BMI, cholesterol, HDL, LDL and VLDL. 
For SKAT-based methods, each p-value listed in the table is the smallest p-value from 5 univariate 
analysis, where the Bonferroni corrected significance level should be 0.05/5. 


have certain limitations. For example, most of burden tests 
assume homogeneous effects within the SNV-set, which may 
not reflect the true underlying disease models. Besides burden 
tests, most of the existing methods are parametric or semi- 
perimetric, which often assume certain distributions of the 
phenotypes and mode of inheritance. When the assumptions 
are violated, the results can be unreliable. 

To overcome these limitations, we have proposed a gen¬ 
eralized similarity U test for multivariate analysis of differ¬ 
ent types of phenotypes. We conducted extensive simula¬ 
tion studies using data from the 1000 Genome Project and 
compared the performance of GSU with that of three other 
popular methods: SKAT, AdjSKAT and SKATO. For all 
of the simulation scenarios, including single phenotype with 
various distributions, and multiple phenotypes with various 
combinations of phenotype distribution, GSU outperformed 
the other three methods in terms of robust type I error and 
higher power. Although the simulation results depend on the 
simulation settings, and should always be interpreted in the 
context of the simulation setting, we believe the results reflect 
the advantage of GSU in a broader sense, because 1) the 
genetic data used in the simulation comes from the 1000 
Genome Project, which reflects the LD pattern and the allelic 
frequency distribution in the general population; and 2) we 
simulated a wide range of disease models to mimic common 
disease scenarios. 

In recent years, U-statistic-based methods have been pop¬ 
ularly used in genetic data analysis, and have shown their 


robustness and flexibility for analyzing genetic data( Schaid 


et al. 2005 Zhang et al. 2010 Wei et al. 2013). GSU 


is a general framework based on similarity measurements. 
Although in this paper we used the weighted-IBS to calculate 
genetic similarity, other forms of genetic similarity can also 
be used. For the weighted-IBS similarity, we can modify the 
weights (i.e., the original weight w m = — 7m)) 

to reflect the importance of each SNV. Besides IBS-based 
similarity, we can also use distance-transformed similarity 
to model the effect in a nonlinear way. For example, we 
can use the Euclidean-Distance-based similarity, Kff = 
exp(— Wm(3i,m-ffj,m) 2 )- In this paper, we have focused 

on the analysis of categorical sequencing data (i.e, SNV data). 
By using appropriate genetic similarity measurements, GSU 
can easily be extended to analyze other types of genetic 


data, such as count data (CNV data) and continuous data 
(expression data). 

The flexibility of GSU also comes from the construction 
of the phenotype similarity. By using phenotype similarity, 
GSU can analyze not only a single phenotype, but also 
multiple phenotypes following different distributions. Many 
genetic studies collect multiple secondary phenotypes, or use 
intermediate biomarkers, to study complex diseases. By con¬ 
sidering multiple phenotypes that measure the different as¬ 
pects of underlying diseases, the power of association analysis 


can be potentially improved (Zhang et al. 2010 Liu et al. 


2009 Maity et al. 2012). Nevertheless, few methods have 


been developed for multivariate analysis of sequencing data. 
Methods were recently developed for multivariate analysis of 
common genetic variants, but relies on the normal assumption 


for the phenotype distribution (Maity et al. 2012). GSU 


is developed for both univariate and multivariate analysis 
of sequencing data. It is distribution-free and can analyze 
multiple phenotypes with a combination of different types of 
phenotype. Nevertheless, similar as all other non-parametric 
tests, it is not straightforward for GSU to perform covariate 
adjustment. One possible solution is to adopt the idea of 
covariate adjustment in regression, and include covariates 
when we calculate the centered similarity (e.g., S = (I — 
X(X T X)~ 1 X T )S(I - X{X T X)~ 1 X T )). 


Supplementary Materials 

Supplementary Materials are available with this paper at the 
Biometrics website on Wiley Online Library. 
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Using the same argument, we have the corresponding results 
for the decomposition of the centered genetic similarity: 

OO 

f(Gi, Gj) = ^77t<Msi)<M32). 

t =i 


Appendix A: Centered Similarity 

By the following definition of centered similarity, 

h(yi,V2) = h(yi,y 2 ) - E(h{yi,Y 2 )) 

—E(h(Y 1 ,y 2 )) + E{h(Y ll Y 2 )) 

we can obtain conditional expectation for the centered phe¬ 
notype similarity, 

A(h,(y 1 ,y 2 )|Ui) = s(/i(y 1 ,y 2 |yi))-s(/i(yi,y 2 |yi)) 
-E(h{Y 1 ,Y 2 )) + E{h{Y 1 ,Y 2 )) 

= 0 . 

Therefore, we have E(h(Yi,Y 2 )) = 0 and 

Var(E(h.(Yi,Y 2 )\Yi)) = 0. Using the same argument, 

we can have the same result for the centered genetic 
similarity. 

Under the null hypothesis, when the genetic similarities are 
independent of the phenotype similarities, we have, 

E (U ) = A ^(E hGi, Gj)h(Yi, Yj)) 

= -Az TV EG J ))A(ft(y, Yj)) 

1 ' iAi 

= 0 . 


Appendix B: Proof of Theorem 1 
We can decompose the centered phenotype similarity by, 

OO 

h{yi, V 2 .) = E A B 0 s (2/i)<A s (y 2 ), 

S— 1 

where {A s } and {</> s (-)} are eigenvalues and eigenfunctions of 
the kennel h (-, •). 

Because of the orthogonality of {<)> s (-)} , we have 
E{h(Y 1 ,Y 2 )cl> s '(Y2)\Y 1 ) 

= j h{Yi,y 2 )<j> s < ( y 2 )dF(y 2 ) 

OO p 

=E A ^) 

s =1 ^ 

= \s>ts'(Yl). 

We showed E(h(Yi,Y 2 ) x l|yf) = 0 x 1 in Appendix A, which 
forced <j> i(-) = 1 and Ai = 0 to be an eigenfunction-eigenvalue 
pair in the decomposition of h(- , ■). Again, because </>i(-) is 
orthogonal with )} s >i, for s > 1, we have 

EM* l) = J Mvi) x ldF (yi) 

= J 4>s{yi)<l>i{yi)dF(y 1 ) 

= o. 


Then, 


E<t> s (Y) = 0, Vs > 1 
Eifi t (G) = 0 Vt > 1. 


(Ax.l) 


Using the function decomposition, we can write GSU as, 
1 


U = 


n {n — 1 ) j 


E/(G I ,G j )h(y,y}) 


i Ai 


n(A~T) £ £ ytMGJMGi) E A s MYi)MYj) 

^ ' izA 7 t =1 s=l 


i^j t= 

OO OO 


n{n — 1) 


E^ E A * E <Pt(Gi)<Pt(Gj)MYi)MYj) 


t =1 s=l 


^ OO OO / ^ 10 


72—1 


t =2 s=2 

OO OO 


- T E * E A » - E (MGi)MYi)f 

n — * ■* ' n * J 


72—1 zz ' 72 

t=2 s=2 i=l 


1 ft 

sign(rj t A s ) ( /77 ^ (Gi)<l>8 (XQ 

t=2 s=2 \V z=l 


I OO OO . iii 

-^iEE signet\ s ) n (7/t (£2*)0s O'*)) ? 

i=2 s=2 i=l 

where <p*(Gi) = |??i| 0 ' 5 ^t(Gi) and </>J(y) = |A s |°- 5 ^ 3 (y). 

Under the null hypothesis, genotype (G;) is independent of 
phenotypes (Yj). Therefore, for s > 1 and t > 1, 

£(%*(GiK(yi)) 

=|r, t |°- 5 ^ t (Gi)|A s |°' 5 ^ s (yi) 

=0, (Ax.2) 


and 


E{rit(G 1 )ti(Y 1 )rit'(G 1 )<fa(yi)) 

=\rit\ s rit' \ s '\°' 5 E(<pt(Gi)ip t ' (Gi))E(<j> B (Yi)<j> B i(Yi)) 

|r; t A s |, if s = s'and t = t' 


0, 


otherwise. 


(Ax.3) 


Therefore, for any finite subset A of {(s, t)} s >i,t>i, 


{^E"=i vUgmuy)} 


converges to a multivariate 

(s,t)e a 

normal distribution by using results from equation |Ax.2| 
equation |Ax.3| and multivariate CLT. 

Additionally, we can show that, 


E E{r^(Gi)<f>*s(Y i)) 2 

S> 1, t >1 

=EWEM 

s t 

<oo. 


Under the condition E s >i t>i Eljjt{Gi)4>t(Yi)) 2 < oo, the 
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countable sequence of function {rj ?(•)'/’*(')} is a Donsker 
class! van der Vaart and Wellner[ 20001. Therefore, the em- 
pirical process, ^ XilLi Vt (Gi)4>s(Yi), converges weakly to 
the Gaussian process Z Si t with mean zero and covariance 
function, 


cov ( Z ., t , Z s ,, f ) = S(^(Gi)^(yi)^(Gi)^(Fi)) 

J |r?tAs|, if s = s'and t = t' 

1 0, otherwise. 

With this uniform convergence (for all s > 1 and t > 1), we 
can show that, 

oo oo 

nU A sign(rj t \ s ){Z s _ t ) 2 

t =2 s =2 
oo oo 

-EE sign(r]t\s)\rit^s\ 

t =2 s=2 
oo oo 

= E^E^x-- 1 )’ 

t= 1 S = 1 

where Xst are i-i-d chi-squared random variables with a d.f. of 

1. 


Appendix C: Proof of Theorm 2 

To simplify the notation, we denote X = (Y. G) and 
m(A'i,A' 2 ) = /(Gi, G 2 )h(Yi, Y 2 ). GSU can then be rewritten 
as: 

Define a centered kernel u(xi,X 2 ) by: 

u(x i,x 2 ) = u(a:i,a; 2 ) - «i(a;i) - in(a: 2 ) - yu, 

where ui(x) = E(u(X i, A 2 )|AT = x). 

We can decompose the GSU as follows: 


U = 




i(n — 1) 




, 1 n E (*(*<> *i) + «iPG) + M*i) - y) 

n(n — 1) z —' 

1 9 " 

n?n — E *(*-*;) + n E - a*) + **• 

V i =1 


Thus, 


^ ” M) = Jh E - /*) + A E *(*> ■ x i )■ 

V i=l ' ' 

Becuase -E(-ui(A')) = yr and Var(u i(A')) = £i, the first term 
converges to normal distribution by applying CLT: 


-4EMa*)-m) Aa(0,4Ci). 


Then we need to show: 

\/n 


V ' i?U 


This can be done by proving ER 2 —> 0, using the 

fact that £(u(AT,A 2 )) = 0, Var(u ( AT, X 2 )) < oo and 
E(u( Ai, A 2 )|A'i) = 0. In fact, by using the similar technique 
in Appendix C, we can show that y/nR asymptotically follows 
the distribution of a weighted sum of independent chi-square 
random variables. 


Appendix D: Matrix Similarity 

In the study sample, we can calculate the centered phenotype 
similarity by: 

1 n | n i 

Hyi,yj) = »)- - E w)- - E Mi/t »)+^ E 

j = l i= 1 i,j 

Denote Sij = h(yi,yj) and Si t j = h(yi,yj), the above 
equation becomes: 

n n 

3= 1 »=1 i,j 

The equations can be written in a matrix form, 

S = S-JS-SJ+ JSJ 
= (■ I-J)S(I-J ), 

where S = {h(yi,yj)} n xn, S = {h(yi,y ;)}„*», I = 

{1 {i=j}}nXni and J — |-} n xn. 

Similarly, the centered genetic similarity can also be written 
in the matrix form: 

K m (I - J)K(I - J), 

where K = {f{gu 9j)}nxn, and K = {/(ffi, flj)}nxn- 


Appendix E: Limiting Distribution 


In the actual computation, we will use a matrix eigen- 
decomposition to obtain the eigenvectors as a finite-dimension 
approximation of the eigenfunctions. For a matrix eigen- 
decomposition, a computer algorithm usually gives the eigen¬ 
value X s with the eigenvector (j> 3 , which satisfies 0s,* = 1 

instead of i4*s,i — 1- Therefore, using the eigenvalues 

As and rjt calculated from the matrix eigen-decomposition, 
the limiting distribution of GSU is: 


nU~ E-E-(X 

' T). ' 71. 


t= 1 S=1 
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